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Abstract. We define and discuss tiie first sparse coding algorithm based on closed- 
form EM updates and continuous latent variables. The underlying generative 
model consists of a standard 'spike-and-slab' prior and a Gaussian noise model. 
Closed-form solutions for E- and M-step equations are derived by generalizing 
probabilistic PCA. The resulting EM algorithm can take all modes of a poten- 
tially multi-modal posterior into account. The computational cost of the algo- 
rithm scales exponentially with the number of hidden dimensions. However, with 
current computational resources, it is still possible to efficiently learn model pa- 
rameters for medium-scale problems. Thus the model can be applied to the typical 
range of source separation tasks. In numerical experiments on artificial data we 
verify likelihood maximization and show that the derived algorithm recovers the 
sparse directions of standard sparse coding distributions. On source separation 
benchmarks comprised of realistic data we show that the algorithm is competi- 
tive with other recent methods. 



1 Introduction 

Probabilistic generative models are a standard approach to model data distributions 
and to infer instructive information about the data generating process. Methods like 
principle component analysis, factor analysis, or sparse coding (SC) (e.g., [14|) have all 
been formulated in the form of probabilistic generative models. Moreover, independent 
component analysis (ICA), which is a very popular approach to blind source separation, 
can also be recovered from sparse coding in the limit of zero observation noise (e.g., 

HI). 

A standard procedure to optimize parameters in generative models is the application 
of Expectation Maximization (EM) (e.g., |13|). However, for many generative models 
the optimization using EM is analytically intractable. For stationary data only the most 
elementary models such as mixture models and factor analysis (which contains proba- 
bilistic PCA as special case) have closed-form solutions for E- and M-step equations. 
EM for more elaborate models requires approximations. In particular, sparse coding 
models (| 14'9'161 and many more) require approximations because integrals over the 
latent variables do not have closed-form solutions. 
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In this work we study a generative model that combines the Gaussian prior of prob- 
abilistic PCA (p-PCA) with a binary prior distribution. Distributions combining binary 
and continuous parts have been discussed and used as priors before (e.g., 1 1 1 1) and are 
commonly referred to as 'spike-and-slab' distributions. Also sparse coding variants with 
spike-and-slab distributions have been studied previously (compare M20I7I18I15I8I12I ). 
However, in this work we show that combining binary and Gaussian latents maintains 
the p-PCA property of having a closed-form solution for EM optimization. We can, 
therefore, derive an algorithm that uses exact posteriors with potentially many modes 
to update model parameters. 
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Fig. 1. Distributions generated by the 
GSC generative model. The left col- 
umn shows the distributions gener- 
ated for TTh ~ 1 for all h. In this 
case the model generates p-PCA dis- 
tributions. The middle column shows 
an intermediate value of tt^. The gen- 
erated distributions are not Gaussians 
anymore but have a slight star shape. 
The right column shows distributions 
for small values of -kh- The generated 
distributions have a salient star shape 
similar to standard sparse coding dis- 
tributions. 



2 The Gaussian Sparse Coding (GSC) model 

Let us first consider a pair of iJ-dimensional i.i.d. latent vectors, a continuous zeR^ 
and a binary se{0, 1}^ with: 

H 

p{s \0) = Yl nl" (1 - nhf'''^ = Bernoulli(s; tt) and p{z \0) = M{z; 0, 1^), (1) 

h = l 

where Wh parameterizes the probability of non-zero entries. After generation, both hid- 
den vectors are combined using a pointwise multiplication operator: i.e., {s z)h = 
Sfi Zfi for all h. The resulting hidden random variable is a vector of continuous values 
and zeroes, and it follows a 'spike-and-slab' distribution. Given a hidden vector (which 
we will denote by s 2;) , we generate a /^-dimensional observation y e by 
linearly combining a set of basis functions W and adding Gaussian noise: 

piy\s,z,e>)^Afiy;WisQz),S), (2) 

where W G R^^^ is the matrix containing the basis functions Wh as columns, and 
E G R^^^ is a covariance matrix parameterizing the data noise. The latents' priors ^ 
together with their pointwise combination and the noise distribution (j2]) define the gen- 
erative model under consideration. As a special case, the model contains probabilistic 
PCA (or factor analysis). This can easily be seen by setting all tt/i equal to one. 
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The model ([TJ to (j2]i is capable of generating a broad range of distributions including 
sparse coding like distributions. This is illustrated in Fig. [T] where the parameters tt^, 
allow for continuously changing PCA-like to a SC-like distribution. 

While the generative model itself has been studied previously ||20I18I15I8I . we will 
show that a closed-form EM algorithm can be derived, which can be applied to blind 
source separation tasks. We will refer to the generative model ([T]i to (|2]i as the Gaus- 
sian Sparse Coding (GSC) model in order to stress that a specific spike-and-slab prior 
(Gaussian slab) in conjunction with a Gaussian noise model is used. The GSC model 
is thus an instance of the spike-and-slab sparse coding model (or alternatively known 
sparse factor analysis models; see e.g., II20I18I15I8II ). 



2.1 Expectation Maximization (EM) for Parameter Optimization 

Consider a set of N independent data points {y^^^}n=i,....N with y*^"^ e R^. For 
these data we seek parameters — {W, 2J, tt) that maximize the data likelihood 
£ — n^=i P{y ^"^ I ^) under the GSC generative model. We employ Expectation Max- 
imization (EM) algorithm for parameter optimization. The EM algorithm (13] optimizes 
the data hkehhood w.r.t. the parameters by iteratively maximizing the free-energy 
given by: 

N r 

^(0°'^e)= E E/ log {p{y <^"^ \s,z,0))+ log {pis \0)) 

n—l s z 

+ log {p{z I 0))] dz + H{0°^'^) , (3) 

where H{0°^'^) is an entropy term only depending on parameter values held fixed dur- 
ing the optimization of T w.r.t. 0. Note that integration over the hidden space involves 
an integral over the continuous part and a sum over the binary part. 

Optimizing the free-energy consists of two steps: given the current parameters 0°^'^ 
the posterior probability is computed in the E-step; and given the posterior, J"(6)°'^,6)) 
is maximized w.r.t. in the M-step. Iteratively applying E- and M-steps locally maxi- 
mizes the data likelihood. 

M-step parameter updates: Let us first consider the maximization of the free-energy 
in the M-step before considering expectation values w.r.t. to the posterior in the E-step. 
Given a generative model, conditions for a maximum free-energy are canonically de- 
rived by setting the derivatives of T{0°^'^, 0) w.r.t. the second argument to zero. For 
the GSC model we obtain the following parameter updates: 

JV JV 

1^ - ( E y ( E J". (4) 

n—l n—l 
1 ^ 

^ - ^ E [y '"'(y *"')^ - 2(w^(« ^> J (y '"Y + w{{{sQ z){s zr)w-')] 

n=l 

1 ^ r 

^"'^ '^=^E(^)n' *here {f{s,z))^^Yl / p{s, z \ y^"\0°"') /{s, z) dz. (5) 

n=l s ■'^ 
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Equations Q to (jsll define a new set of parameter values ~ (VF, Z", tt) given the 
current values O . These 'old' parameters are only used to compute the sufficient 
statistics {s) ^, 2:)^ and ((s z){s s;)"^)^ of the model. 

Expectation Values: Although the derivation of M-step equations can be analytically 
intricate, it is the E-step that, for most generative models, poses the major challenge. 
Source of the problems involved are analytically intractable integrals required for pos- 
terior distributions and for expectation values w.r.t. the posterior. The true posterior 
is therefore often replaced by an approximate distribution (see, e.g. , [2, 16 1) or in the 
form of factored variational distributions II6I5II . The most frequently used approxima- 
tion is the maximum-a-posterior (MAP) estimate (see, e.g., ri4'91) which replaces the 
true posterior by a delta-function around the posterior's maximum value. Alternatively, 
analytically intractable expectation values are often approximated using sampling ap- 
proaches. Using approximations always implies, however, that many analytical prop- 
erties of exact EM are not maintained. Approximate EM iterations may, for instance, 
decrease the likelihood or may not recover (local or global) likelihood optima in many 
cases. There are nevertheless, a limited number of models with exact EM solutions; 
e.g., mixture models such as the mixture-of-Gaussians, p-PCA or factor analysis etc. 
Our novel work here extends the set of known models with exact EM solutions. By fol- 
lowing along the same lines as for the p-PCA derivations, we maintain in our E-step the 
analytical tractability of computing expectation values w.r.t. the posterior of the GSC 
model 

Posterior Probability: First observe that the discrete latent variable s of the GSC 
model can be directly combined with the basis functions, i.e., W{s Q z) = Wg z, 
where {Ws)dh = WdhSh- Now we apply the B ayes' rule to write down the posterior: 

e)- f^{y'''^;WsZ,s)M{z;o,iH)pis\e) 

Y.s'!^iv^''^\Ws:z',S)N{z'-Q,lH)p{s'\e)Az'' 

Note that given a state s in (j6|, the Gaussian governing the observations y is only 
dependent on the Gaussian over the continuous latent z, which is analytically indepen- 
dent of s. We can exploit this joint relation to refactorize the Gaussians. Using Gaussian 
identities the posterior can be rewritten as: 



p{s,z\y^-\0) 



AA(i/(");O,a)p(8|0)AA(z;4"\/l.) 
Y.^,N{y(-);Q.Cs>)p{s'\0) jAfiz'; K^::\As,)dz' 
= p{s\y(-\0)M{z;Ki^^\As), (7) 



where Cs = WsWj + A^ ^ (l^J IJ-^ + 1h) ^ 

and /«^") = As Wj IJ-^ y . (8) 

Equations (|7]i to ([8]l represent the crucial result for the computation of the E-step be- 
low because, first, they show that the posterior does not involve analytically intractable 
integrals and, second, for fixed s and y ^"^ the dependency on z follows a Gaussian 
distribution. This special form allows for the derivation of analytical expressions for 
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the expectation values as required for the M-step parameter updates. 

E-step Equations: Derived from (j?]), the expectation values are computed as: 

s s 

and ((s0^)(s0^)T)^^ = ^p(s|y("),0)(^ + 4")(K("))T). (10) 

Note that we have to use the cuiTent values = 0°^'^ for all parameters on the right- 
hand-side. The E-step equations Q to ( fTO] ) represent a closed-form solution for expec- 
tation values required for the closed-form M-step Q to (|5]l. 

Relation to the Mixture of Gaussians: The special form of the posterior in ([7| al- 
lows the derivation of a closed-form experession of the data likelihood: i.e., p{y \ 0) = 
Bernoulli(s; tt) J^{y 0, Cg). Note that in principle, this form can be repro- 
duced by a Gaussian mixture model. However, such a model would consist of 2^ mix- 
ture components, with strongly dependent mixing proportions and covariance matrices 
Cs - Closed-form EM-updates can in general not be derived for such dependencies. The 
standard updates for mixtures of Gaussians require independently parameterized mix- 
ing proportions and components. Therefore, the closed-form EM-solutions for the GSC 
model is not a consequence of closed-form EM for classical Gaussian mixtures. 

3 Numerical Experiments 

GSC parameter optimization is non-convex. However, as for all algorithms based on 
closed-form EM, the GSC algorithm always increases the data likelihood at least to 
a local maxima. We first numerically investigate how frequently local optima are ob- 
tained. Later we assess model's performace on more practical tasks. 

Model verification: First, we verify on artificial data that the algorithm increases the 
likelihood and that it can recover the parameters of the generating distribution. For this, 
we generated A'^ = 500 data points y from the GSC generative model ([T]l to (j2]| with 
D — H — 2. We used randomly initialized generative parameters[^ The algorithm was 
run 250 times on the generated data. For each run we performed 300 EM iterations. 
For each run, we randomly and uniformly initialized tt/j between 0.05 and 10, set S to 
the covariance accross the data points, and the elements of W we chose to be indepen- 
dently drawn from a normal distribution with zero mean and unit variance. In all runs 
the generating parameter values were recovered with high accuracy. Runs with different 
generating parameters produced essentially the same results. 

Recovery of sparse directions: To test the model's robustness w.r.t. a relaxation of the 
GSC assumptions, we applied the GSC algorithm to data generated by standard sparse 
coding models. We used a standard Cauchy prior and a Gaussian noise model lfT4l for 

' We obtained W^'^'^ by independently drawing each matrix entry from a normal distribution 
with zero mean and standard deviation 3. 7r^°" values were drawn from a uniform distribution 
between 0.05 and 1, X' = a^^'^ln (where a^"^" was uniformly drawn between 0.05 and 10). 
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Fig. 2. Histogram of hkelihood 
values for 100 runs of the GSC 
algorithm on data generated by 
a SC model with Cauchy prior. 
Almost all runs converged to 
high likelihood values. 

iog(£(e)) 



data generation. Fig.[3]second panel shows data generated by this sparse coding model 
while the first panel shows the prior density along one of its hidden dimensions. We 
generated N — 500 data points with H = D = 2. We then applied the GSC algorithm 
with the same parameter initialization as in the previous experiment. We performed 100 
trials using 300 EM iterations per trial. Again, the algorithm converged to high likeli- 
hood values in most runs (see Fig. [2]). As a performace measure for this experiment we 
investigated how well the heavy tails (i.e., the sparse directions) of standard SC were 
recovered. As a performance metric, we used the Amari index [IJ: 



cauchy prior generated sample spike-and-slap prior learned GSC distribution 




Fig. 3. Comparison of standard sparse coding and GSC. Left panels: Cauchy distribution (along 
one hidden dimension) as a standard SC prior 1141 and data generated by it. Right panels: Spike- 
and-slab distribution (one of the hidden dimensions) inferred by the GSC algorithm along with 
inferred sparse directions (solid red lines) and posterior data density contours (dotted red lines). 



A(w\ _ 1 / \Ohh'\ I \Ohh'\ \ L nn 

^^^^ > - 2H{H-1) 2^hM = l Vmax^,, |0^,.„| + maX;,,, \0,,„^,\) H-l "^^^^ 

where Ohw '■— mean Amari index of all runs with high like- 

lihood values was below lO^'^, which shows a very accurate recovery of the sparse 
directions. Fig. [3] (right panel) visualizes the distribution recovered by the GSC algo- 
rithm in a typical run. The dotted red lines show the density contours of the learned 
distribution p{y \ 0). High accuracy in the recovery of the generating sparse directions 
(solid black lines) can be observed by comparison with the recovered directions (solid 
red lines). The results of experiments are qualitatively the same if we increase the num- 
ber of hidden and observed dimensions; e.g., for H = D = Awe found the algorithm 
converged to a high likelihood in 91 (with average Amari index below 10^^) of 100 
runs. 

Other than standard SC with Cauchy prior, we also ran the algorithm on data gen- 
erated by SC with Laplace prior 1 14 9|. There for H — D — 2, we converged to high 
likelihood values in 99 of 100 runs with an average Amari index 0.06. In the experiment 
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Table 1. Performance of different algorithms on benchmarks for source separation. Data for NG- 
LICA, KICA, PICA, and JADE are taken from 1 17|. Performances are compared based on the 
Amari index Bold values highlight the best performing algorithm(s). 



datasets 


Amari index (standard deviation) 


name 


N 


GSC GSC^ NG-LICA KICA PICA JADE 


lOhalo 


200 
500 


0.34(0.05) 0.29(0.03) 0.29(0.02) 0.38(0.03) 0.33(0.07) 0.36(0.00) 
0.27(0.01) 0.27(0.01) 0.22(0.02) 0.37(0.03) 0.22(0.03) 0.28(0.00) 


Sergio7 


200 
500 


0.23(0.06) 0.20(0.06) 0.04(0.01) 0.38(0.04) 0.05(0.02) 0.07(0.00) 
0.18(0.05) 0.17(0.03) 0.05(0.02) 0.37(0.03) 0.04(0.01) 0.04(0.00) 


Speech4 


200 
500 


0.25(0.05) 0.17(0.04) 0.18(0.03) 0.29(0.05) 0.20(0.03) 0.22(0.00) 
0.11(0.04) 0.05(0.01) 0.07(0.00) 0.10(0.04) 0.10(0.04) 0.06(0.00) 


c5 signals 


200 
500 


0.39(0.03) 0.44(0.05) 0.12(0.01) 0.25(0.15) 0.10(0.02) 0.12(0.00) 
0.41(0.05) 0.44(0.04) 0.06(0.04) 0.07(0.06) 0.04(0.02) 0.07(0.00) 



with H — D — A the algorithm converged to a high likelihood in 97 of 100 runs. The 
average Amari index of all runs with high likelihoods was 0.07 in this case. 

Source separation: We applied the GSC algorithm to publicly available benchmarks. 
We used the non-artificial benchmarks of |17|. The datasets mainly contain acoustic 
data obtained from ICALAB |3 1. We generated the observed data by mixing the bench- 
mark sources using randomly generated orthogonal mixing matrix (we followed ITtI ). 



Again, the Amari index (111 was used as a performace measure. 




Fig. 4. Histogram of the deviation from or- 
thogonality of the W matrix for 100 runs of 
the GSC algorithm on the Speech4 bench- 
mark (A'^ — 500). A clear cluster of the 
most orthogonal runs can automatically be 
detected: the threshold of runs considered is 
defined to be the minimum after the cluster 
(black arrow). 



For all the benchmarks we used N ~ 200 and N = 500 data points (as selected 
bv lflTl ). We applied GSC to the data using the same initialization as described before. 
For each experiment we performed 100 trials with a random new parameter initializa- 
tion per trial. The first column of Tab. 1 list average Amari indices obtained including 
all trials per experimenj^ It is important to note that all the other algorithms listed in 
the comparison assume orthogonal mixing matrices, while the GSC algorithm does not. 
Therefore in the column 'GSC^' in Tab. 1, we report statistics that are only computed 
over the runs which infered the most orthogonal W matrices. As a measure of orthogo- 
nality we used the maximal deviation from 90° between any two axes. Fig.|4]shows as 
an example a histogram of the maximal deviations of all trials on the Speech4 data 
with N = 500. As can be observed, we obtained a clear cluster of runs with high or- 
thogonality. We observed worst performace of the GSC algorithm on the cSsignals 



^ We obtained the reported results by diagonalizing the updated S in the M-step by setting 
S = a^lD, where cr^ = Tr(E)/D. 
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dataset. However, the dataset contains sub-Gaussian sources which in general can not 
be recovered by sparse coding approaches. 



4 Discussion 

The GSC algorithm falls into the class of standard SC algorithms. However, instead of 
a heavy-tail prior, it uses a spike-and-slab distribution. The algorithm has a distinguish- 
ing capability of taking the whole (potentially a multimodal) posterior into account for 
parameter optimization, which is in contrast to the MAP approximation of the poste- 
rior, which is a widely used approach for training SC models (see, e.g., |9 10 1) . Various 
sophisticated methods have been proposed to efficently find the MAP (e.g., 1 19J). MAP 
based optimizations usually also require regularization parameters that have to be in- 
ferred (e.g., by means of cross-validation). Other approximations that take more aspects 
of the posterior into account are also being investigated actively (e.g., I116I12II '). How- 
ever, approximations can introduce learning biases. For instance, MAP and Laplace 
approximations assume monomodality in posterior estimation, which is not always the 
case. 

Closed-form EM learning of the GSC algorithm also comes with a cost, which is 
exponential w.r.t. the number of hidden dimensions H. This can be seen by considering 
([8]l where the partition function requires a summation over all binary vectors s (similar 
for expectation values w.rt. the posterior). Nevertheless, we show in numerical exper- 
iments that the algorithm is well applicable to the typical range of source separation 
tasks. In such domains the GSC algorithm can benefit from taking potentially a multi- 
modal posterior into acount and infering a whole set of model parameters including the 
sparsitiy per latent dimension. For instance, when using a number of hidden dimensions 
larger than the number of actual sources, the model can discard dimensions by setting 
TT/i parameters to zero. The studied model could thus be considered as treating parame- 
ter inference in a more Bayesian way than, e.g., SC with MAP estimates (compare ||9|). 
The second line of research aims at a fully Bayesian description of sparse coding and 
emphasises a large flexibility using estimations of the number of hidden dimensions 
and by being applicable with a range of different noise models. The great challenge of 
these general models is the procedure of parameter estimation. For the model in lfT2ll . 
for instance, the Bayesian methodology involves conjugate priors and hyperparameters 
in combination with Laplace approximation and different sampling schemes. While the 
aim, e.g., in I 20I7I18I15I8I12 I is a large flexibility, we aim at a generalization of sparse 
coding that maintains the closed-form EM solutions. 

To summarize, we have studied a novel sparse coding algorithm and have shown its 
competitiveness on source separation benchmarks. Along with the reported results on 
source separation, the main contribution of this work is the derivation and numerical 
investigation of the (to the knowledge of the authors) first closed-form, exact EM algo- 
rithm for spkie-and-slab sparse coding. 
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